A Flash based dashboard would be generated in this HTML file, which could be only only displayed when hosted on a web server, or is placed in a directory which has been added to the trusted sources in the [security settings of Macromedia]. Here are two ways ensuring you could see the dashboard correctly.
Please go ‘Safari Preferences’ -> ‘Security’ -> check ‘Enable JavaScript’ and ‘Allow Plug-ins’ -> ‘Plug-in Settings’ -> check ‘Adobe Flash Player’ -> select ‘on’ in ‘when visiting other websites’ -> ‘done’
Please go ‘Chrome Settings’ -> ‘Advanced’ -> ‘Privacy and Security’ -> ‘Content Settings’ -> ‘Flash’ -> ‘Allow’ -> ‘Add’ -> enter your dictionary contains this HTML file
Sorry for the inconvience.
This is the final project of 12-709 Data Analytics for Engineered Systems. We are expected to implement an end-to-end data analysis including data collection, data organization, data cleansing, data transformation, data analysis and data visualization.
In this data analysis project, I choose Gapminder World data to explore. It shows the health and wealth of all countries from late 20th century to 21st century. There is huge information about the world health and wealth condition hidden in this dataset and more than hundreds of the countries’ data has been included in the dataset.
How the world evolves and changes over the centutries and how does the world look like now? These questions are not just related to the social sciences but also related to economics and would compact other engineering developments. So I would like to use this topic to explore, and try to generate stories from the data and explain the hidden principle behind the data.
There are some key questions we could raise about this topic, and following tasks are what we are going to explore in this project:
In this section, I would identify the datasets that would used in this final project.
The file project_data.rda contains the following data frames, which all pertain to global health statistics
pop.by.age: contains the population for 138 countries, for the years 1950-2050 (using projected populations after 2007), broken up into three age groups (0-19 years, 20-60 years, and 61+ years)gdp.lifeExp: the per capita GDP (a measure of economic wealth) and life expectancy for these countries, for the years 1952-2007gdp.lifeExp.small: a small subset of the years in gdp.lifeExpcontinents: the continent of each country in the previous datasetsThis data was made famous by Hans Rosling (1948-2017) and his Gapminder Foundation. You can see one of his videos here: https://www.youtube.com/watch?v=BPt8ElTQMIg
There are several libraries we need:
library(ggplot2)
library(mclust)
library(plyr)
library(dplyr)
library(reshape2)
library(tidyr)
library(knitr)
library(splines)
library(googleVis)
library(RJSONIO)
Let us load our data first.
load('/Users/apple/Desktop/12709/project/project_data.rda')
Then we need to rename the age group labels into 0-19 Years old, 20-60 Years old and 61+ Years old.
colnames(pop.by.age)=c('country','year','0-19 Years old','20-60 Years old','61+ Years old','continent')
Firstly, show how the population of all the countries around the world changes over time grouped by three different age groups. To better compare and cluster all the countries population, we use the percentages to scale the dataset.
# mutate the total population for each row in the dataset
sum.population=pop.by.age[,c("0-19 Years old","20-60 Years old","61+ Years old")]
sum.population=cbind(sum.population, sum.pop=rowSums(sum.population))
pop.by.age$pop.sum=sum.population$sum.pop
# melt the data to have the age.group label
pop.by.age.melt=melt(pop.by.age,id.vars = c('country','year','continent','pop.sum'))
# rename the label into 'age.group'
colnames(pop.by.age.melt)[5]='age.group'
# mutate the percentage of the population
pop.by.age.melt$percent=pop.by.age.melt$value/pop.by.age.melt$pop.sum
# show how the population of all the countries changes over time grouped by three different age groups
ggplot(data=pop.by.age.melt, mapping=aes(x=year, y=percent,group=country,color=continent)) + geom_line()+facet_wrap('age.group')+labs(x = 'Year', y = 'Percentage', title='Population by Age (in percentage), 1950-2050 \n (All Countries)',caption='Figure 1.1')+theme(plot.title = element_text(hjust = 0.5),plot.caption =element_text(hjust = 0.5))
We could see from the figure above that different countries might have different variation trend, so now let us use clustering method to divide all of these countries into four groups according to their various behaviors during the evolutionary process.
First, reshape the dataset to call Mclust method.
# melt the data with two features 'age.group' and 'percentage'
pop.by.age.clust=pop.by.age.melt[,-c(4,6)]
# mutate the interaction between the year and the age.group
pop.by.age.clust = mutate(pop.by.age.clust, year.age.group = interaction(year, age.group))
# the spread command works best if there are no extraneous columns, so we only select a subset from the dataset
pop.by.age.clust = subset(pop.by.age.clust, select = c('country', 'year.age.group', 'percent'))
# spread command
pop.by.age.clust.spread = spread(pop.by.age.clust, key = year.age.group, value= percent)
# call the Mclust function and set G to 4 (i.e. 4 clusters)
clust=Mclust(pop.by.age.clust.spread[,1:64],G=4)
# create a new column to add the clustering results
pop.by.age.clust.spread$clust=clust$classification
# change the label names for classification
pop.by.age.clust.spread$clust[pop.by.age.clust.spread$clust==1]='Group 1'
pop.by.age.clust.spread$clust[pop.by.age.clust.spread$clust==2]='Group 2'
pop.by.age.clust.spread$clust[pop.by.age.clust.spread$clust==3]='Group 3'
pop.by.age.clust.spread$clust[pop.by.age.clust.spread$clust==4]='Group 4'
Then we could plot the data set with clustering results and we should also add the previous plot as the references.
# first make a copy
pop.by.age.melt.2=pop.by.age.melt
# then merge the dataset to add cluster labels to the dataset
label.df=pop.by.age.clust.spread[,c(1,65)]
pop.by.age.melt.2=merge(pop.by.age.melt.2,label.df,by='country')
# show the plot with pooled data
plot=ggplot(data=pop.by.age.melt.2,mapping=aes(x=year,y=percent,group=country,color=continent))+geom_line(data=pop.by.age.melt,mapping=aes(group=country),color='grey',alpha=0.2)+geom_line() +labs(x = 'Year', y = 'Percentage', title='Population by Age (in percentage), 1950-2050 \n (All Countries)',caption='Figure 1.2')+theme(plot.caption=element_text(hjust = 0.5),plot.title = element_text(hjust = 0.5))+facet_grid(clust ~ age.group)
# add the annotation
len=12
vars=data.frame(expand.grid(levels(factor(pop.by.age.melt.2$clust)),levels(pop.by.age.melt.2$age.group)))
colnames(vars)=c('clust','age.group')
dat=data.frame(x=rep(2010,len),y=rep(0.6,len),vars,labs=c('Starts low/Ends low','Starts high/Ends medium','Ends low','Starts high/Ends high','Starts high/Ends low','Starts low/Ends high','Starts medium/Ends medium','Starts low/Ends high','Starts high/Ends high','Starts low/Ends medium','Starts medium/Ends high','Starts low/Ends low'))
dat[1:4,2]=0.1
dat[5:8,2]=0.2
plot+geom_text(aes(x,y,label=labs,group=NULL),color='black',data=dat)
This plot shows how the age demographics are changing over time for all 138 countries in the data set, where we have used the Mclust clustering algorithm to divide the countries up into four groups (note that the clusters differ slightly from the continents):
The clusters show that the countries could be divided into four different groups with their different behaviors on the age demographics changing over time (changing trends are represented as the annotations in each facet plot).
It is also worth noting that Group 1 contains most of the European countries where the percentage of population in younger age keeps lower while the old populations are always higher than other countries. And most countries in Group 4 are African countries showing the reverse. This could also be interpreted by the following plots in Section 2 and Section 3 and the analysis on the improvements in living quality all over the world.
First, let us plot the data for all countries. Since two categories (i.e. lifeExp and GDP) have different scales of value, we need to set the scales ‘free’ for each item.
# melt the data
gdp.lifeExp.melt=melt(gdp.lifeExp,id.vars=c(1,2,5))
# add annotation
dat=data.frame(x=1980,y=90000,variable='gdp.per.capita',labs=c('Kuwait'))
# show the plot for all countries
ggplot(data=gdp.lifeExp.melt, mapping=aes(x=year, y=value,group=country,color=continent)) + geom_line()+facet_wrap('variable',scales = 'free')+labs(x = 'Year', y = 'Value', title='Life Expectancy and GDP per Capita, 1952-2007 \n (All Countries)',caption='Figure 2.1')+theme(plot.title = element_text(hjust = 0.5),plot.caption = element_text(hjust = 0.5))+geom_text(aes(x,y,label=labs,group=NULL),color='black',data=dat)
Here is an outlier: Kuwait.
It seems like the development of GDP per capita in Kuwait was obviously having a different pattern against from other countries in the world. The overall trend for the GDP per capita of was growing up over time while Kuwait was not. Kuwait might have wrong record, which made it become an outlier in the whole dataset. Since clustering aims to find the countries with similar patterns of changes of both life expectancy and GDP over time, this outlier, which is a very different kind, would have a strong impact on the clustering performance. So we need to remove it from the dataset to build a better clustering model.
# remove the Kuwait
gdp.lifeExp=subset(gdp.lifeExp,country!='Kuwait')
gdp.lifeExp.melt=melt(gdp.lifeExp,id.vars=c(1,2,5))
Use clustering to divide the countries into groups that had similar changes to life expectancy and GDP over time.
After removed Kuwait form the dataset, the clustering results are shown as below.
gdp.lifeExp.melt.2= mutate(gdp.lifeExp.melt, year.variable = interaction(year, variable))
# the spread command works best if there are no extraneous columns
gdp.lifeExp.melt.2 = subset(gdp.lifeExp.melt.2, select = c('country', 'year.variable', 'value'))
# spread command
gdp.lifeExp.spread = spread(gdp.lifeExp.melt.2, key = year.variable, value= value)
# call the Mclust function and set G to 4 (i.e. 4 clusters)
clust.2=Mclust(gdp.lifeExp.spread[,1:25],G=3)
gdp.lifeExp.spread$clust=clust.2$classification
# change the label names for classification
gdp.lifeExp.spread$clust[gdp.lifeExp.spread$clust==1]='Group 1'
gdp.lifeExp.spread$clust[gdp.lifeExp.spread$clust==2]='Group 2'
gdp.lifeExp.spread$clust[gdp.lifeExp.spread$clust==3]='Group 3'
# melt the data to have the label
label.df.2=gdp.lifeExp.spread[,c(1,26)]
gdp.lifeExp.melt.2=merge(gdp.lifeExp.melt,label.df.2,by='country')
# add annotations
len=6
vars=data.frame(expand.grid(levels(gdp.lifeExp.melt.2$variable),levels(factor(gdp.lifeExp.melt.2$clust))))
colnames(vars)=c('variable','clust')
# create data frame storing the text contents and locations
dat=data.frame(x=rep(1970,len),y=rep(25,len),vars,labs=c('Starts low/Ends low \n with fluctuation','Starts low/Ends low','Starts medium/Ends medium','Starts medium/Ends medium','Starts high/Ends high','with fluctuation'))
# change some locations
dat[2,2]=40000
dat[4,2]=40000
dat[6,2]=40000
# show the plot with pooled data and annotations
ggplot(data=gdp.lifeExp.melt.2,mapping=aes(x=year,y=value, color=continent))+geom_line(data=gdp.lifeExp.melt,mapping=aes(group=country),color='grey',alpha=0.2)+geom_line(aes(group = country)) +labs(x = 'Year', y = 'Value', title='Life Expectancy and GDP per Capita, 1952-2007 \n (All Countries)',caption='Figure 2.2')+theme(plot.title = element_text(hjust = 0.5),plot.caption = element_text(hjust = 0.5))+facet_grid(variable~clust,scales = 'free')+geom_text(aes(x,y,label=labs,group=NULL),color='black',data=dat)
Let us look at some statistics for different clusters.
# analyze the clustering model
clust.summary=summary(clust.2,parameters = T)
# look at the probabilities and means:
kable(data.frame(clust.summary$mean),col.names = c('Group 1','Group 2','Group 3'))
| Group 1 | Group 2 | Group 3 | |
|---|---|---|---|
| country | 69.69647 | 68.58637 | 81.52483 |
| 1952.lifeExp | 38.64181 | 55.50668 | 57.17531 |
| 1957.lifeExp | 40.85091 | 58.14238 | 59.73890 |
| 1962.lifeExp | 42.77892 | 60.36405 | 61.94179 |
| 1967.lifeExp | 45.04716 | 62.18791 | 64.03310 |
| 1972.lifeExp | 47.03081 | 64.09486 | 66.06212 |
| 1977.lifeExp | 48.84440 | 66.07853 | 68.09769 |
| 1982.lifeExp | 51.05258 | 67.85088 | 69.94686 |
| 1987.lifeExp | 52.80174 | 69.40743 | 71.69604 |
| 1992.lifeExp | 53.30340 | 70.68084 | 72.89370 |
| 1997.lifeExp | 53.49926 | 72.03664 | 74.07241 |
| 2002.lifeExp | 53.44693 | 73.32632 | 75.00575 |
| 2007.lifeExp | 55.08611 | 74.42111 | 76.11241 |
| 1952.gdp.per.capita | 1044.26687 | 3780.15323 | 5399.29125 |
| 1957.gdp.per.capita | 1138.25888 | 4486.18583 | 6541.31105 |
| 1962.gdp.per.capita | 1245.20581 | 5184.60953 | 7755.11988 |
| 1967.gdp.per.capita | 1386.38020 | 6150.77321 | 9949.07312 |
| 1972.gdp.per.capita | 1521.91824 | 7365.05568 | 12820.27612 |
| 1977.gdp.per.capita | 1545.57399 | 8366.50534 | 15375.75194 |
| 1982.gdp.per.capita | 1611.44942 | 9033.30778 | 15943.18500 |
| 1987.gdp.per.capita | 1578.89024 | 9731.14666 | 16666.70495 |
| 1992.gdp.per.capita | 1563.39021 | 9893.93017 | 17462.42512 |
| 1997.gdp.per.capita | 1629.79993 | 11015.02676 | 19661.30053 |
| 2002.gdp.per.capita | 1771.59017 | 12154.87238 | 21503.77506 |
| 2007.gdp.per.capita | 2095.14811 | 14202.39739 | 25323.11331 |
By extracting the mean values for each group, we could see that Group 1 always contains the countries who has the lowest mean percentage every year in that label (i.e. age group) and Group 3 groups the countries together with the highest mean percentage every year for each age group, and countries in Group 2 lie in the middle. Also, Group 1 and Group 3 countries show some fluctuations in the changing process.
So far, we have created some plots, but there are still several disadvantages of the plots above:
Firstly, it is hard to match each line with each country in the plots above. Since each cluster contains many countries from different continents, it is also hard to tell the trajectory for each continent.
And also the dataset above with continuous years tells us more information about the detailed process of the evolving with a dynamic view, but also makes us tend to ignore the final result of from a macro view.
So now let us only look at the data at the beginning year (1952) and the end year (2007):
# remove the Kuwait from the dataset
gdp.lifeExp.small=subset(gdp.lifeExp.small,country!='Kuwait')
# plot the life expectancy against GDP per capita in 1952 and 2007, respectively
ggplot(data=gdp.lifeExp.small,mapping=aes(x=gdp.per.capita,y=lifeExp,color=continent, label=country))+geom_point(size=1)+facet_wrap('year',scales = 'free')+geom_text(size=4,hjust=0, nudge_y =0.1)+labs(x = 'GDP per capita', y = 'Life Expectancy', title='Life Expectancy against GDP per Capita \n (1952 vs. 2007)',caption='Figure 3.1')+theme(plot.title = element_text(hjust = 0.5),plot.caption = element_text(hjust = 0.5))
This plot shows the life expectancy (y-axis) and GDP per capita (x-axis) for each country in both 1952 and 2007.
Each subplot shows that life expectancy at birth, increases at a decreasing rate with respect to GDP per capita (PPP).
The main reason for this non-linear relationship is because people consume both needs and wants. People consume needs in order to survive. Once a person’s needs are satisfied, they could then spend the rest of their money on non-necessities. If everyone’s needs are satisfied, then any increase in GDP per capita would barely affect life expectancy.
Rich people live longer?
The relationship between life expectancy and GDP per capita is strong enough to be the basis of a regression model. Simple functions that increase at a decreasing rate include multiplicative (hyperbolas) and logarithmic functions.
The following is R output for a regression model that was fitted to the data:
# select the data in 2017
regression.data=subset(gdp.lifeExp.small,year==2007,select=c(3,4))
# create non-linear function to fit the data
spline.model = lm(lifeExp ~ ns(gdp.per.capita, df=4), data=regression.data)
# show the summary
summary(spline.model)
##
## Call:
## lm(formula = lifeExp ~ ns(gdp.per.capita, df = 4), data = regression.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.424 -1.663 1.582 4.172 12.035
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.039 1.872 25.122 < 2e-16 ***
## ns(gdp.per.capita, df = 4)1 22.860 2.565 8.911 2.92e-15 ***
## ns(gdp.per.capita, df = 4)2 25.475 3.600 7.076 7.13e-11 ***
## ns(gdp.per.capita, df = 4)3 50.571 4.946 10.225 < 2e-16 ***
## ns(gdp.per.capita, df = 4)4 24.201 4.114 5.883 2.98e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.125 on 136 degrees of freedom
## Multiple R-squared: 0.6622, Adjusted R-squared: 0.6522
## F-statistic: 66.64 on 4 and 136 DF, p-value: < 2.2e-16
By looking at this summary, we could see that the regression model includes: + An intercept + A hyperbolic term + A linear term The model fits quite well to the data (R-squared statistic of 66.2%). It isn’t necessarily the best model, but it appears to be a fairly good one.
# show the residual plot
plot(spline.model,which=1)
We could see from the residual plot that during the life expectancy from 60 to 70 the residuals are above 0. Following figure show how the model fit the data.
ggplot(data=regression.data,mapping=aes(x=gdp.per.capita,y=lifeExp))+
geom_point(size=1)+
geom_smooth(method='lm', formula = y ~ ns(x, df=4))+
labs(x = 'GDP per capita', y = 'Life Expectancy', title='Regression Model of Life Expectancy against GDP per Capita',caption='Figure 3.2')+theme(plot.title = element_text(hjust = 0.5),plot.caption = element_text(hjust = 0.5))
Shaded (i.e. grey) area tells about the variance, and we could use this model to predict the life expectancy given a certain gdp.per.capita.
And when comparing the two subplots in two different years, we could simply see from two axes ranges that the world had overall both higher life expectancy and GDP per capita. In 1952, some countries had a life expectancy less than 40 years and the whole world only had income less than 15K dollars. In 2007, almost all of the countries had a life expectancy more than 40 years and GDP per capita was greatly increased.
The overall trends of the life expectancy and GDP per capita all over the world are both increasing. Most of the African countries improved life expectancy a lot but still had relative lower GDP per capita; most countries in Asia were getting both healthier and richer; and Europe kept its both good health condition and high GDP per capita stably.
Since there are a lot of overlaps of the country names on the plot, it is still hard to see each country’s changing pattern. We would do more detailed plots as below.
Firstly, let us calculate the total changes (represented with the growth percentage) of life expectancy and GDP per capita from 1952 to 2007 (represented with the growth percentages), respectively.
# mutate the growth percentages
data.1952=subset(gdp.lifeExp.small,year==1952)
data.2007=subset(gdp.lifeExp.small,year==2007)
lifeExp.change=mutate(data.1952,lifeExp.change=(data.2007$lifeExp-data.1952$lifeExp)/data.1952$lifeExp)
gdp.change=mutate(data.1952,gdp.change=(data.2007$gdp.per.capita-data.1952$gdp.per.capita)/data.1952$gdp.per.capita)
gdp.lifeExp.change=mutate(gdp.change,lifeExp.change=(data.2007$lifeExp-data.1952$lifeExp)/data.1952$lifeExp)
Show the plot as below:
ggplot(data=gdp.lifeExp.change,mapping=aes(x=gdp.change,y=lifeExp.change,color=continent,label=country))+geom_point(size=1)+labs(x = 'GDP per capita growth percentage', y = 'Life expectancy growth percentage', title='Life Expectancy Growth Percentage against GDP per Capita Growth Percentage \n (from 1952 to 2007)',caption='Figure 3.3')+theme(plot.title = element_text(hjust = 0.5),plot.caption = element_text(hjust = 0.5))
The plot shows the life expectancy growth percentage against GDP per capita growth percentage all over the world. Now let us focus on the changes for each continent and we would see that Europe had a relative higher increasing percentage of economics but limited life expectancy growth. And contrast to Europe, who had little variance on both growths, African and American countries had high variance of life expectancy. Asian countries improved their health condition and economics a lot, and some of them developed their economics really fast.
After the analysis for the whole world and each continent, now let us look at the trajectory for each country. First, add some reference lines into the datasets for further comparisons.
# reference line of global average growth percentage
lifeExp.change=mutate(lifeExp.change,avg.lifeExp.change=mean(lifeExp.change))
gdp.change=mutate(gdp.change,avg.gdp.change=mean(gdp.change))
# continental average
lifeExp.change=ddply(lifeExp.change,'continent',mutate,avg.lifeExp.change.c=mean(lifeExp.change))
gdp.change=ddply(gdp.change,'continent',mutate,avg.gdp.change.c=mean(gdp.change))
To get a better and more detailed understanding of each country’s evolution, we may need to show the growth percentages of each country’s GDP per capita and life expectancy separately.
Here is the growth percentages plot of each country’s GDP per capita (ascending order).
ggplot(data=gdp.change,mapping=aes(x=reorder(country,gdp.change),y=gdp.change,color=continent))+
geom_point(size=1.2)+
geom_line(data=gdp.change,mapping=aes(x=country,y=avg.gdp.change,group=1),color='grey')+
geom_line(data=gdp.change,mapping=aes(x=country,y=avg.gdp.change.c,group=1),alpha=0.8)+
facet_wrap('continent',scales = 'free',nrow=3)+
labs(x = 'Country', y = 'Growth Percentage', title='GDP per Capita Growth Percentage \n (from 1952 to 2007)',caption='Figure 4.1')+
theme(axis.text.x=element_text(angle = 90,hjust=1),plot.title = element_text(hjust = 0.5),plot.caption = element_text(hjust = 0.5))
The grey reference lines show the average growth percentage all over the world, and the lines with different colors show the averages for different continents.
By comparing the two reference lines, we could see that Asia and Europe exceeded the global average, and Africa, Americas and Oceania were sub-average.
However, there are several countries go against the overall increasing growth trend. Let us make a list for these countries:
# show a table for countries with decreased GDP per capita
GDP.decrease=subset(gdp.change,gdp.change<0)
GDP.decrease=GDP.decrease[,c(1,5,6)]
kable(GDP.decrease,col.names = c('country','continent','GDP per capita growth percentage'))
| country | continent | GDP per capita growth percentage | |
|---|---|---|---|
| 8 | Central African Republic | Africa | -0.3409787 |
| 10 | Comoros | Africa | -0.1059329 |
| 11 | Congo, Dem. Rep. | Africa | -0.6444115 |
| 14 | Djibouti | Africa | -0.2199069 |
| 26 | Liberia | Africa | -0.2798353 |
| 28 | Madagascar | Africa | -0.2759795 |
| 36 | Niger | Africa | -0.1866470 |
| 42 | Sierra Leone | Africa | -0.0196036 |
| 43 | Somalia | Africa | -0.1845554 |
| 65 | Haiti | Americas | -0.3470665 |
| 69 | Nicaragua | Americas | -0.1166454 |
We could see from the table that most of the countries with negative growth of economy were African countries.
Now let us plot the growth percentages of each country’s life expectancy (ascending order).
ggplot(data=lifeExp.change,mapping=aes(x=reorder(country,lifeExp.change),y=lifeExp.change,color=continent))+
geom_point(size=1.2)+
geom_line(data=lifeExp.change,mapping=aes(x=country,y=avg.lifeExp.change,group=1),color='grey')+
geom_line(data=lifeExp.change,mapping=aes(x=country,y=avg.lifeExp.change.c,group=1),alpha=0.8)+
facet_wrap('continent',scales = 'free',nrow=3)+
labs(x = 'Country', y = 'Growth Percentage', title='Life Expectancy Growth Percentage \n (from 1952 to 2007)',caption='Figure 4.2')+
theme(axis.text.x=element_text(angle = 90,hjust=1),plot.title = element_text(hjust = 0.5),plot.caption = element_text(hjust = 0.5))
We could see from the plot that the overall trend of the life expectancy all over the world is increasing, too. Africa, America and Asia exceeded the overall average, and Europe and Oceania were subaverage.
Similarly, let us list the countries with negative growth percentage:
# show a table for countries with decreased life expectancy
lifeExp.decrease=subset(lifeExp.change,lifeExp.change<0)
lifeExp.decrease=lifeExp.decrease[,c(1,5,6)]
kable(lifeExp.decrease,col.names = c('country','continent','life expectancy growth percentage'))
| country | continent | life expectancy growth percentage | |
|---|---|---|---|
| 46 | Swaziland | Africa | -0.043326 |
| 52 | Zimbabwe | Africa | -0.102454 |
The table shows that all of the countries with negative growth of life expectancy were African countries.
In this section, I would design an interactive dashboard to show the time series data and analysis by using googleVis library.
# set option to show the chart only
op <- options(gvis.plot.tag='chart')
|
|
# set options back to original options
options(op)
These combined motion charts could be served as an interactive dashboard for users to explore the time series data. Users may select the variables for the x-axis and y-axis, and also could select the variables in Color (like continent) and Size (like population). You can Select only some certain countries to show, and also by checking the Trails option, you could see the track of that country over time. One more thing to say about the dashboard is that, if you select gdp.per.capita to be the x-axis values and life.expectancy as the y-axis values, by selecting Log instead of “Lin”, you should get the similar plot like we have plotted above to show the logarithmic relationship between the two variables.
This project mainly explores some questions about the world health and wealth problems. I believe that some findings would help us have better understanding and deeper insights about how the world looks like over the recent 100 years and how it evolves. It would help us learn from the history so that we could get improvement in the future.
In Section 1, we find out four different clusters (shown in Figure 1.2) for all of the countries in the world according to their population evolution process for the entire time span. Demographics change may resulted from the global developed economy and living quality. So we further study on the improvement on life expectancy and GDP per capita for each continent in Section 2 and Section 3, and also for each country in Section 4.
From the plots in Section 2 and Section 3, we could say that in the world the overall levels of both health condition and economy were rising (from 1952 to 2007). But there were some individual countries that went against this global pattern (as listed in the tables in Section 4). Some countries (most came from Africa) have shown a negative economic growth and some (all came from Africa) even suffered a health crisis, but there was no countries showing negative growth in both life expectancy and GDP per capita.
In Section 4, Figure 4.1 shows that Europe and Asia have made a large contribution to the global economic growth. Especially the economic rise of Asia was really amazing. Also, as for the improvement on life expectancy, Asia again played the most important role in improving the global health condition (shown in Figure 4.2). We could assume that the positive growth of economic would to some degree raise the living standard and result in a positive growth of life expectancy. And indeed we could read from Figure 3.1 that compared with the global status in 1952, we became to have a continuous world in 2007: high-income with high life expectancy, middle-income with middle life expectancy, low-income with low life expectancy. The world has become healthier and richer. But the gap between the richest and the poorest was getting even more enormous.
The raw dataset in this project was not quite consistent and “clean”. In this project, sometimes I just simply omit/remove some NA values but maybe I should use other better ways to deal with these problems. I would like to learn more about data cleansing and how to process the unconsistent data.
What is more, for now I have not found better way to generate the infograph directly from R. I have tried my best to add some text notations right in the figures to make them more like an infograph. In the future I would like to learn more about creating infographs using R.
In this project, I use R to do all of the work including reading data, data cleansing, data transformation, data analysis, data visualization, creating dashboards and also generate the documentation. Some work might be more easily to do by using other tools. Like creating dashboards in Tableau, or adding texts to the figures using Power Point. Since I prefer to use codes to do all of the work (because they are re-usable in the future), I just have tried my best to take advantage of different libraries in R to do all of the work. However, it would be also amazing and time-saving to utilize different tools including SQL, Excel, Tableau, PPT and etc. and combine these work together to complete a whole project.